James Rouse
Introduction:
This project revolves around predicting player performance for the 2023 baseball season. Focusing on individual player hitting data, the goal is to build predictive models that predict a player's potential contribution to scoring runs. The made up scenario is that a team is in need of someone to bat in runs.
Target Variable: Runs Batted In (RBIs) serves as the key performance metric. It signifies a player's impact on their team's runs.
Approach: The project spans data collection, preprocessing, exploratory data analysis (EDA), and model building. By parsing hitting data from the previous 5 seasons, key attributes were extracted, such as batting average, slugging percentage, home runs, and walks.
Building Predictive Models: Employing various machine learning algorithms as well as linear regression, models will be constructed to predict RBI's. Training, tuning, and evaluation will lead to models capable of forecasting player performance.
# Importing libraries
import pandas as pd
import numpy as np
from pybaseball import batting_stats_range, team_batting
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, KFold, GridSearchCV, cross_val_predict
from sklearn.tree import plot_tree
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import statsmodels.api as sm
from sklearn.model_selection import KFold
pd.options.mode.chained_assignment = None
Data Collection: To start, data was fetched from pybaseball. pybaseball is a Python package that provides a convenient interface for accessing various baseball-related data, statistics, and information from online sources. The code below uses the function batting_stats_ragne from pybaseball, which creates a data frame called batting_data_df. This contains batting data for all MLB players this year.
# Get the current year
current_year = datetime.now().year
today_date = datetime.now().date()
today_str = today_date.strftime('%Y-%m-%d')
# Calculate yesterday's date
# Fetch batting statistics for the current season up until yesterday
raw_data = batting_stats_range(start_dt=f'{current_year}-03-30', end_dt=today_str)
# Print the fetched data
batting_data_df = pd.DataFrame(raw_data)
batting_data_df.head()
| Name | Age | #days | Lev | Tm | G | PA | AB | R | H | ... | SH | SF | GDP | SB | CS | BA | OBP | SLG | OPS | mlbID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | CJ Abrams | 23 | 1 | Maj-NL | Washington | 100 | 446 | 398 | 65 | 103 | ... | 1 | 2 | 2 | 20 | 10 | 0.259 | 0.333 | 0.460 | 0.792 | 682928 |
| 2 | Jos\xc3\xa9 Abreu | 37 | 50 | Maj-AL | Houston | 33 | 112 | 106 | 9 | 14 | ... | 0 | 1 | 4 | 0 | 0 | 0.132 | 0.170 | 0.208 | 0.377 | 547989 |
| 3 | Wilyer Abreu | 25 | 1 | Maj-AL | Boston | 82 | 281 | 251 | 42 | 67 | ... | 0 | 3 | 4 | 7 | 2 | 0.267 | 0.335 | 0.486 | 0.821 | 677800 |
| 4 | Ronald Acu\xc3\xb1a Jr. | 26 | 67 | Maj-NL | Atlanta | 48 | 217 | 188 | 37 | 46 | ... | 0 | 0 | 4 | 15 | 3 | 0.245 | 0.346 | 0.362 | 0.707 | 660670 |
| 5 | Willy Adames | 28 | 1 | Maj-NL | Milwaukee | 107 | 464 | 408 | 57 | 102 | ... | 0 | 3 | 11 | 12 | 3 | 0.250 | 0.334 | 0.436 | 0.770 | 642715 |
5 rows × 28 columns
Data Cleaning: Ensuring Accurate Analysis
Baseball data often contains errors, missing values, and inconsistencies due to various data collection methods and sources. Cleaning the data removes inaccuracies, fixes missing values, and standardizes variables
# Getting info about the columns in the dataframe
batting_data_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 593 entries, 1 to 616 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 593 non-null object 1 Age 593 non-null int64 2 #days 593 non-null int64 3 Lev 593 non-null object 4 Tm 593 non-null object 5 G 593 non-null int64 6 PA 593 non-null int64 7 AB 593 non-null int64 8 R 593 non-null int64 9 H 593 non-null int64 10 2B 593 non-null int64 11 3B 593 non-null int64 12 HR 593 non-null int64 13 RBI 593 non-null int64 14 BB 593 non-null int64 15 IBB 593 non-null int64 16 SO 593 non-null int64 17 HBP 593 non-null int64 18 SH 593 non-null int64 19 SF 593 non-null int64 20 GDP 593 non-null int64 21 SB 593 non-null int64 22 CS 593 non-null int64 23 BA 589 non-null float64 24 OBP 589 non-null float64 25 SLG 589 non-null float64 26 OPS 589 non-null float64 27 mlbID 593 non-null int64 dtypes: float64(4), int64(21), object(3) memory usage: 134.4+ KB
It looks like there are 6 rows with missing values. Since only 6 out of the 609 rows are missing, these rows can just be dropped.
# Dropping NA values
batting_data_df = batting_data_df.dropna()
batting_data_df.duplicated().sum()
0
Because I am predicting RBI's, I can remove variables that won't be needed for the prediction. ID, days, name, as well as base running stats won't be necessary for calculating runs batted in. The ID and name are just labels and base running occurs after a batter is on base, where they can't obtain any RBI's.
# Dropping unecessary columns
columns_to_drop = ['mlbID', 'CS', 'SB', '#days', 'Name']
batting_data_df = batting_data_df.drop(columns=columns_to_drop)
In order for consistency, ease of access, and clarity, the variable names will be changed to snake case format
# Creating a dictionary to map the data.
column_name_mapping = {
'Age': 'age',
'Lev': 'league',
'Tm': 'team',
'G': 'games',
'PA': 'plate_appearances',
'AB': 'at_bats',
'R': 'runs',
'H': 'hits',
'2B': 'double',
'3B': 'triple',
'HR': 'home_run',
'RBI': 'runs_batted_in',
'BB': 'walk',
'IBB': 'intentional_walk',
'SO': 'strikeouts',
'HBP': 'hit_by_pitch',
'SH': 'sacrifice_hit',
'SF': 'sacrifice_fly',
'GDP': 'double_play',
'BA': 'batting_average',
'OBP': 'on_base_percentage',
'SLG': 'slugging_percentage',
'OPS': 'on_base_plus_slugging'
}
batting_data_df.rename(columns=column_name_mapping, inplace=True)
# Print the modified DataFrame
batting_data_df.columns
Index(['age', 'league', 'team', 'games', 'plate_appearances', 'at_bats',
'runs', 'hits', 'double', 'triple', 'home_run', 'runs_batted_in',
'walk', 'intentional_walk', 'strikeouts', 'hit_by_pitch',
'sacrifice_hit', 'sacrifice_fly', 'double_play', 'batting_average',
'on_base_percentage', 'slugging_percentage', 'on_base_plus_slugging'],
dtype='object')
Investigating the teams column, as it is of type object in order to see what it contains. While investigating, it looked like some players had multiple teams separated by a comma. It looks like only 6% of players have played for multiple teams and later on in the project, a table will be joined by team name, so these players will just be dropped.
# Inspecting the unique values of the 'team' column
batting_data_df['team'].unique()
array(['Washington', 'Houston', 'Boston', 'Atlanta', 'Milwaukee',
'Los Angeles', 'Los Angeles,San Francisco', 'Arizona',
'Kansas City', 'Oakland', 'New York', 'Colorado', 'Chicago',
'Miami', 'Tampa Bay', 'St. Louis', 'Cleveland',
'Seattle,Tampa Bay', 'Miami,San Diego', 'San Diego', 'Detroit',
'Pittsburgh', 'San Francisco', 'Baltimore', 'Toronto',
'Cincinnati', 'Chicago,Miami', 'Los Angeles,Toronto', 'Seattle',
'Philadelphia', 'Minnesota', 'St. Louis,Tampa Bay', 'Texas',
'Miami,New York', 'Boston,Chicago', 'New York,Oakland',
'Miami,Pittsburgh', 'Chicago,Kansas City', 'Houston,Oakland',
'Cincinnati,Seattle', 'Los Angeles,Tampa Bay', 'Chicago,Texas',
'Atlanta,Los Angeles', 'Baltimore,Philadelphia',
'San Francisco,Texas', 'Boston,Toronto', 'Detroit,Texas',
'Atlanta,Cleveland', 'Houston,Toronto', 'Baltimore,San Francisco',
'Atlanta,Philadelphia', 'Chicago,Tampa Bay', 'Chicago,New York',
'Chicago,St. Louis', 'Chicago,Los Angeles', 'Tampa Bay,Washington',
'Seattle,Washington', 'Atlanta,Washington', 'Chicago,Washington',
'Atlanta,Boston,New York', 'Cincinnati,San Francisco',
'Atlanta,San Francisco', 'Baltimore,Miami', 'Cleveland,Washington',
'Los Angeles,New York', 'Seattle,Toronto', 'Atlanta,Miami',
'New York,Washington'], dtype=object)
# Determining how many values in the 'team' column have multiple teams
count_with_comma = batting_data_df[batting_data_df['team'].str.contains(',')].shape[0]
# Count the total number of players
total_players = batting_data_df.shape[0]
# Calculate the proportion
proportion_with_comma = (count_with_comma / total_players) * 100
print(f"Proportion of players with comma in team name (traded players): {proportion_with_comma:.2f}%")
Proportion of players with comma in team name (traded players): 7.47%
# Get value counts of all teams
batting_data_df['team'].value_counts()
Chicago 39
Los Angeles 34
New York 33
Cincinnati 25
Seattle 21
..
Chicago,Texas 1
Atlanta,Los Angeles 1
San Francisco,Texas 1
Boston,Toronto 1
New York,Washington 1
Name: team, Length: 68, dtype: int64
As stated earlier, it looks like the data contains players who belong to multiple teams. These are likely players who have likely been traded. Because there is a small proportion of player who have been traded, these players will just be ommitted from the data.
# Removing data where the team has a comma.
batting_data_df = batting_data_df[~batting_data_df['team'].str.contains('[,]')]
batting_data_df['team'].unique()
array(['Washington', 'Houston', 'Boston', 'Atlanta', 'Milwaukee',
'Los Angeles', 'Arizona', 'Kansas City', 'Oakland', 'New York',
'Colorado', 'Chicago', 'Miami', 'Tampa Bay', 'St. Louis',
'Cleveland', 'San Diego', 'Detroit', 'Pittsburgh', 'San Francisco',
'Baltimore', 'Toronto', 'Cincinnati', 'Seattle', 'Philadelphia',
'Minnesota', 'Texas'], dtype=object)
It is also clear that there is no differentiation between teams in the same city. To fix this, we will need to check the league column to see which team the player is actually on to fix this error. For example, if a team is in New York, but in the American League, we will set the team name to New York Yankees insstead of just New York
# Correcting team names for cities with multiple teams
conditions = [
(batting_data_df['team'] == 'New York') & (batting_data_df['league'] == 'Maj-AL'),
(batting_data_df['team'] == 'New York') & (batting_data_df['league'] == 'Maj-NL'),
(batting_data_df['team'] == 'Los Angeles') & (batting_data_df['league'] == 'Maj-NL'),
(batting_data_df['team'] == 'Los Angeles') & (batting_data_df['league'] == 'Maj-AL'),
(batting_data_df['team'] == 'Chicago') & (batting_data_df['league'] == 'Maj-NL'),
(batting_data_df['team'] == 'Chicago') & (batting_data_df['league'] == 'Maj-AL')
]
choices = ['New York Yankees', 'New York Mets', 'Los Angeles Dodgers', 'Los Angeles Angels', 'Chicago Cubs', 'Chicago White Sox']
batting_data_df['team'] = np.select(conditions, choices, default=batting_data_df['team'])
batting_data_df.head()
| age | league | team | games | plate_appearances | at_bats | runs | hits | double | triple | ... | intentional_walk | strikeouts | hit_by_pitch | sacrifice_hit | sacrifice_fly | double_play | batting_average | on_base_percentage | slugging_percentage | on_base_plus_slugging | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 23 | Maj-NL | Washington | 100 | 446 | 398 | 65 | 103 | 23 | 6 | ... | 3 | 88 | 12 | 1 | 2 | 2 | 0.259 | 0.333 | 0.460 | 0.792 |
| 2 | 37 | Maj-AL | Houston | 33 | 112 | 106 | 9 | 14 | 2 | 0 | ... | 0 | 26 | 2 | 0 | 1 | 4 | 0.132 | 0.170 | 0.208 | 0.377 |
| 3 | 25 | Maj-AL | Boston | 82 | 281 | 251 | 42 | 67 | 24 | 2 | ... | 0 | 77 | 1 | 0 | 3 | 4 | 0.267 | 0.335 | 0.486 | 0.821 |
| 4 | 26 | Maj-NL | Atlanta | 48 | 217 | 188 | 37 | 46 | 8 | 1 | ... | 0 | 52 | 3 | 0 | 0 | 4 | 0.245 | 0.346 | 0.362 | 0.707 |
| 5 | 28 | Maj-NL | Milwaukee | 107 | 464 | 408 | 57 | 102 | 25 | 0 | ... | 2 | 110 | 1 | 0 | 3 | 11 | 0.250 | 0.334 | 0.436 | 0.770 |
5 rows × 23 columns
Implementing a threshold for minimum plate appearances to filter out players with limited or insufficient data.
# Setting a threshold for minimum plate appearance
batting_data_df = batting_data_df[batting_data_df['plate_appearances'] >= 150]
Exploratory Data Analysis
Graphs depicting the correlations of runs batted in will be generated, aiming to explore their associations with various factors such as age, team affiliation, and total hits are constructed
# Scatter plot: RBI vs Hits
plt.figure(figsize=(10, 6))
sns.regplot(data=batting_data_df, x='hits', y='runs_batted_in')
plt.xlabel('Hits')
plt.ylabel('RBIs')
plt.title('Scatter Plot: RBIs vs Hits')
plt.show()
# Box plot: RBIs by Age
plt.figure(figsize=(10, 6))
sns.boxplot(data=batting_data_df, x='age', y='runs_batted_in')
plt.xlabel('Age')
plt.ylabel('RBIs')
plt.title('Box Plot: RBIs by Age')
plt.show()
# Correlation heatmap
correlation_matrix = batting_data_df.corr(numeric_only = True)
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
# Histogram: Distribution of RBIs
plt.figure(figsize=(10, 6))
sns.histplot(data=batting_data_df, x='runs_batted_in', bins=20)
plt.xlabel('RBIs')
plt.ylabel('Frequency')
plt.title('Distribution of RBIs')
plt.show()
# Bar plot: RBIs by Team
plt.figure(figsize=(12, 6))
sns.barplot(data=batting_data_df, x='team', y='runs_batted_in')
plt.xlabel('Team')
plt.ylabel('Average RBIs')
plt.title('Bar Plot: Average RBIs by Team')
plt.xticks(rotation=90)
plt.show()
The exploration of the relationship between Runs Batted In (RBIs) and Hits reveals a notable trend, suggesting a positive correlation between these two variables. As the number of hits increases, players tend to achieve higher RBIs, indicating a connection between their ability to connect with the ball and their effectiveness in driving runs. However, while this positive correlation is apparent, it is essential to consider the distribution of RBIs. A slight right skewness is observed in the distribution, implying that there are more players with fewer RBIs and relatively fewer players with higher RBIs. This skewness may potentially be mitigated through the removal of outliers, which could influence the distribution's shape.
Further analysis of the relationship between age and RBIs indicates a lack of a clear trend between the two variables. Age does not seem to be a significant predictor of RBIs, suggesting that a player's effectiveness in driving runs is not solely influenced by their age. This observation underscores the complex interplay of various factors contributing to a player's performance in this aspect of the game.
Being on different teams appears to impact a player's total RBI stats. To account for this, the team variable will either be dummy-encoded, or another dataframe with team stats will be merged for feature engineering.
When examining the heatmap, one notable finding is the dependence of a player's Runs Batted In (RBIs) on the number of games played and their plate appearances. Additionally, there is a strong link between RBIs and strikeouts. This may seem counterintuitive, but it is not as complex as it appears. When a player has more plate appearances, they tend to accumulate more RBIs and more strikeouts, making these two numbers appear connected.
However, considering whether a high number of strikeouts indicates productivity is crucial. Normalization can address this by dividing each statistic by plate appearances, creating a more balanced perspective. This method allows teams to assess a player's expected performance per at-bat, providing a clearer understanding of their effectiveness.
Correlation Insights: Cumulative Stats and RBIs
Upon inspecting the heatmap of correlations, a notable pattern emerges between RBIs and cumulative statistics, including strikeouts. This relationship stems from the inherent connection between cumulative stats and the number of opportunities for RBIs. Players with more at-bats tend to accumulate more RBIs, a logical outcome considering the increased chances to drive in runs.
However, the presence of high RBIs correlated with higher strikeouts also underscores the nuanced nature of this relationship. While accumulating RBIs is a positive indicator of offensive contributions, an increased frequency of strikeouts may not be as desirable. Striking out is often associated with missed scoring opportunities, highlighting the balance between power hitting and plate discipline.
To ensure a fair assessment of player performance, it becomes imperative to normalize these statistics per at-bat. This approach levels the playing field, allowing us to evaluate how efficiently a player generates RBIs relative to their plate appearances. By normalizing per at-bat, we mitigate the influence of sheer volume and gain a more accurate perspective on a player's RBI efficiency, unearthing the true impact of their offensive prowess.
# Inspecting the columns of the data frame
batting_data_df.columns
Index(['age', 'league', 'team', 'games', 'plate_appearances', 'at_bats',
'runs', 'hits', 'double', 'triple', 'home_run', 'runs_batted_in',
'walk', 'intentional_walk', 'strikeouts', 'hit_by_pitch',
'sacrifice_hit', 'sacrifice_fly', 'double_play', 'batting_average',
'on_base_percentage', 'slugging_percentage', 'on_base_plus_slugging'],
dtype='object')
# Normalizing variables
variables_to_normalize = ['hits', 'double', 'triple', 'home_run', 'runs_batted_in',
'walk', 'intentional_walk', 'strikeouts', 'hit_by_pitch',
'sacrifice_hit', 'sacrifice_fly', 'double_play']
# Iterate through the variables and add normalized columns, then drop the original columns
for var in variables_to_normalize:
normalized_col_name = f'{var}_per_pa'
batting_data_df[normalized_col_name] = batting_data_df[var] / batting_data_df['plate_appearances']
batting_data_df.drop(columns=[var], inplace=True)
# Scatter plot: RBI vs Hits
plt.figure(figsize=(10, 6))
sns.regplot(data=batting_data_df, x='hits_per_pa', y='runs_batted_in_per_pa')
plt.xlabel('Hits per Plate Appearance')
plt.ylabel('RBIs per Plate Appearance')
plt.title('Scatter Plot: RBIs vs Hits (Normalized)')
plt.show()
# Box plot: RBIs by Age
plt.figure(figsize=(10, 6))
sns.boxplot(data=batting_data_df, x='age', y='runs_batted_in_per_pa')
plt.xlabel('Age')
plt.ylabel('RBIs per Plate Appearance')
plt.title('Box Plot: RBIs by Age (Normalized)')
plt.ylim(-.001, 0.3) # Set y-axis limit to 0.3
plt.show()
# Correlation heatmap
correlation_matrix = batting_data_df.corr(numeric_only=True)
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
# Histogram: Distribution of RBIs
plt.figure(figsize=(10, 6))
sns.histplot(data=batting_data_df, x='runs_batted_in_per_pa', bins=20)
plt.xlabel('RBIs per Plate Appearance')
plt.ylabel('Frequency')
plt.title('Distribution of RBIs (Normalized)')
plt.show()
# Bar plot: RBIs by Team
plt.figure(figsize=(12, 6))
sns.barplot(data=batting_data_df, x='team', y='runs_batted_in_per_pa')
plt.xlabel('Team')
plt.ylabel('Average RBIs per Plate Appearance')
plt.title('Bar Plot: Average RBIs by Team (Normalized)')
plt.xticks(rotation=90)
plt.show()
A closer examination of the data, specifically after normalizing the variables by plate appearances, revealed some interesting insights. The connections between most variables and Runs Batted In (RBIs) became less pronounced. This initially surprising observation has a straightforward explanation.
Normalizing the variables by plate appearances means evaluating performance "per turn at bat." By doing so, the analysis considers a player's performance relative to their opportunities to hit. Before normalization, players with more plate appearances naturally accumulated more RBIs and other stats such as hits, walks, and strikeouts. This made these numbers appear more interconnected than they truly were.
With normalization, the analysis levels the playing field, allowing for a fair comparison of players based on their performance in each turn at bat. Players with fewer plate appearances but strong performance in each opportunity stand out more clearly. This method removes the skew caused by varying numbers of plate appearances and highlights each player's efficiency.
Interestingly, normalizing the data also led to a more normal distribution of RBIs per plate appearance. The distribution began to resemble a more balanced and symmetric shape, indicating that when plate appearances are accounted for, RBIs per plate appearance align more closely with a typical distribution.
To prepare for building models, it's important to select the right features to include. A useful step in this process is to visualize how each feature relates to the target variable using regression plots. This approach helps to identify features with strong relationships to the target, which may be valuable for model development.
# Get the list of numeric column names
numeric_column_names = batting_data_df.select_dtypes(include='number').columns.tolist()
# Create subplots for each numeric column
num_rows = (len(numeric_column_names) + 2) // 3
fig, axes = plt.subplots(num_rows, 3, figsize=(15, 5 * num_rows))
# Flatten the axes if there's only one row
if num_rows == 1:
axes = [axes]
# Create scatter plots with regression lines for numeric columns
for i, var in enumerate(numeric_column_names):
row = i // 3
col = i % 3
ax = axes[row][col]
sns.regplot(data=batting_data_df, x=var, y='runs_batted_in_per_pa', ax=ax, color='blue', scatter_kws={'alpha': 0.5})
ax.set_xlabel(var)
ax.set_ylabel('RBIs per Plate Appearance')
ax.set_title(f'{var} vs. RBIs')
# Adjust layout
plt.tight_layout()
plt.show()
The variables 'runs,' 'batting average,' 'slugging percentage,' 'on-base percentage,' 'doubles per plate appearance,' 'triples per plate appearance,' 'hits per plate appearance,' 'home runs per plate appearance,' and 'walks per plate appearance' have shown the strongest linear associations with 'RBIs per plate appearance' in the analysis. Given their significant relationships with the target variable, these variables have been selected for the regression model. Including these key variables will help capture the main factors that influence 'RBIs per plate appearance.' This careful selection ensures that the model will provide meaningful insights and accurate predictions about the relationship between these variables and the target variable.
Outliers for the target variable are removed prior to regression analysis. Outliers are extreme data points that can distort results and decrease model accuracy. Removing outliers helps the model to perform better by focusing on typical data points, leading to a more reliable analysis that accurately reflects the relationships between variables. The criteria used for identifying outliers is being 3 standard deviations away from the mean.
Preparing the Linear Regression Model
One of the assumptions of Because of this, we may want to add team as a categorical variable rather than the merged team stats, as these don't appear to be linear. Variables that do appear to be linear are runs, batting average, slugging percentage, on base percentage, doubles, home runs, hits, and walks. Unfortunately, it doesn't look like the variables that we merged are linear with RBIs per at bat, so they won't be included in the regression
Linear regression is conducted below and the first step is to normalize the data using the min max scaler. After scaling, the coefficients of the predictor variables in the regression model represent the change in the target variable (in this case, RBIs per plate appearance) for a one-unit change in the predictor while keeping other predictors constant. With all variables on the same scale, the coefficients can be directly compared to understand the relative influence of each predictor on the target. Next, the data is cross-validated using k-fold cross-validation for a more accurate and averaged out r-squared and MSE to evaluate the model.
Linear Regression Model
# Constructing first linear regression model
predictor_vars = ['runs', 'batting_average', 'slugging_percentage', 'on_base_percentage', 'double_per_pa', 'triple_per_pa', 'hits_per_pa', 'home_run_per_pa', 'walk_per_pa']
X = batting_data_df[predictor_vars]
scaler = MinMaxScaler()
X[predictor_vars] = scaler.fit_transform(X[predictor_vars])
y = batting_data_df[['runs_batted_in_per_pa']]
# Initialize k-fold cross-validation
num_folds = 5
kf = KFold(n_splits=num_folds, shuffle=True, random_state=25)
mse_scores = []
r2_scores = []
residuals_list = []
y_pred_list = []
for train_index, val_index in kf.split(X):
X_train, X_val = X.iloc[train_index], X.iloc[val_index]
y_train, y_val = y.iloc[train_index], y.iloc[val_index]
model = LinearRegression()
model.fit(X_train, y_train)
y_pred_val = model.predict(X_val)
# Calculate residuals for this fold
residuals = y_val - y_pred_val
residuals_list.append(residuals)
y_pred_list.append(y_pred_val)
mse = mean_squared_error(y_val, y_pred_val)
r2 = r2_score(y_val, y_pred_val)
mse_scores.append(mse)
r2_scores.append(r2)
avg_mse = np.mean(mse_scores)
avg_r2 = np.mean(r2_scores)
print(f"Average Mean Squared Error: {avg_mse}")
print(f"Average R-squared: {avg_r2}")
import statsmodels.api as sm
# Add a constant column to X matrix for intercept
X = sm.add_constant(X)
# Create a linear regression model
model = sm.OLS(y, X)
# Fit the model
results = model.fit()
# Print the summary
print(results.summary())
Average Mean Squared Error: 0.00031177284840925933
Average R-squared: 0.6167006702681447
OLS Regression Results
=================================================================================
Dep. Variable: runs_batted_in_per_pa R-squared: 0.668
Model: OLS Adj. R-squared: 0.659
Method: Least Squares F-statistic: 68.52
Date: Thu, 01 Aug 2024 Prob (F-statistic): 4.54e-68
Time: 23:12:05 Log-Likelihood: 840.03
No. Observations: 316 AIC: -1660.
Df Residuals: 306 BIC: -1622.
Df Model: 9
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
const 0.0702 0.013 5.206 0.000 0.044 0.097
runs -0.0161 0.007 -2.467 0.014 -0.029 -0.003
batting_average 0.5763 0.157 3.664 0.000 0.267 0.886
slugging_percentage -0.2393 0.218 -1.097 0.273 -0.668 0.190
on_base_percentage -0.1421 0.042 -3.411 0.001 -0.224 -0.060
double_per_pa 0.0664 0.038 1.755 0.080 -0.008 0.141
triple_per_pa 0.0304 0.027 1.113 0.267 -0.023 0.084
hits_per_pa -0.3794 0.087 -4.344 0.000 -0.551 -0.207
home_run_per_pa 0.2612 0.122 2.143 0.033 0.021 0.501
walk_per_pa -0.0096 0.019 -0.496 0.620 -0.048 0.029
==============================================================================
Omnibus: 3.424 Durbin-Watson: 2.107
Prob(Omnibus): 0.181 Jarque-Bera (JB): 3.503
Skew: 0.241 Prob(JB): 0.173
Kurtosis: 2.818 Cond. No. 476.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Evaluating the outcomes of the linear regression, we observe an R-squared value of 0.62. This value indicates that approximately 62% of the variation in the target variable can be explained by the predictor variables in the model. Analyzing the coefficients, several insights emerge. Notably, while the constants hold significance, the coefficients for individual variables display varying degrees of influence.
Interestingly, the coefficient for 'hits_per_pa' emerges as the most negative. This might seem counterintuitive at first glance, but it can be attributed to the multicollinearity present in the model. The coefficient for 'hits_per_pa' is influenced by the strong positive coefficient for 'batting_average,' as a higher batting average often translates to more hits. This interdependence suggests that 'hits_per_pa' may not be offering distinct predictive power beyond what 'batting_average' provides.
This introduces a layer of complexity when interpreting the coefficients. Multicollinearity among variables, such as 'hits_per_pa' and 'batting_average,' can make it challenging to disentangle the individual contributions of each variable to the model. In such cases, the coefficients might not accurately reflect the true impact of each variable on the target.
Given this multicollinearity, consideration should be given to potentially removing certain variables from the model. Variables that are strongly correlated or redundant could be excluded to enhance the model's interpretability without compromising its predictive power. This step aligns with the goal of crafting a model that not only predicts well but also provides insights that can be readily understood and applied in the context of baseball analytics.
Linear Regression Assumptions: Ensuring Solid Foundations
Linearity: checked before - only used variables that were linear. Normality: QQ plot doesn't look linear, but the histrogram of the residuals look normally distributed, so the assumption of normality is met.
# Combine residuals from all folds
all_residuals = np.concatenate(residuals_list)
# Create a figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
# Plot histogram of residuals
ax1.hist(all_residuals, bins=20, edgecolor='k')
ax1.set_xlabel('Residuals')
ax1.set_ylabel('Frequency')
ax1.set_title('Histogram of Residuals')
# Plot Q-Q plot of residuals
sm.qqplot(all_residuals, line='s', ax=ax2) # 's' for standard normal distribution line
ax2.set_xlabel('Theoretical Quantiles')
ax2.set_ylabel('Sample Quantiles')
ax2.set_title('Q-Q Plot of Residuals')
plt.tight_layout()
plt.show()
Homoscedasticity (equal variances):A plot of residuals versus fitted values was generated, revealing a scatterplot resembling a random cloud. In such a plot, if the residuals are scattered evenly around the horizontal line at zero, it indicates that the assumption of constant variance (homoscedasticity) is likely met.
# Create residuals vs. fitted values plot
all_y_pred = np.concatenate(y_pred_list)
plt.figure(figsize=(8, 6))
plt.scatter(all_y_pred, all_residuals, color='blue', alpha=0.7)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted Values Plot')
plt.show()
Multicolinearity: The presence of multicollinearity in the analysis is not surprising, especially within the context of baseball statistics. Multicollinearity occurs when predictor variables in a regression model are highly correlated with each other. In baseball, many statistics are intricately linked, as different aspects of a player's performance often feed off one another. For example, a player with a high batting average is likely to have more hits, and consequently, more runs batted in (RBIs).
# Define the predictor variables you want to plot
predictor_vars = ['runs', 'batting_average', 'slugging_percentage', 'on_base_percentage', 'double_per_pa', 'triple_per_pa', 'hits_per_pa', 'home_run_per_pa', 'walk_per_pa']
# Create subplots for each numeric column
num_rows = (len(predictor_vars) + 2) // 3
fig, axes = plt.subplots(num_rows, 3, figsize=(15, 5 * num_rows))
# Flatten the axes if there's only one row
if num_rows == 1:
axes = [axes]
# Create scatter plots with regression lines for predictor variables
for i, var in enumerate(predictor_vars):
row = i // 3
col = i % 3
ax = axes[row][col]
sns.regplot(data=batting_data_df, x=var, y='runs_batted_in_per_pa', ax=ax, color='blue', scatter_kws={'alpha': 0.5})
ax.set_xlabel(var)
ax.set_ylabel('RBIs per Plate Appearance')
ax.set_title(f'{var} vs. RBIs')
# Adjust layout
plt.tight_layout()
plt.show()
In baseball, it's common to encounter significant multicollinearity among variables. Many statistics are interrelated, creating challenges when interpreting coefficients in a regression model. When variables are strongly correlated, isolating their individual effects on the target variable becomes difficult. For example, the model's coefficients might show hits negatively correlated with RBIs, likely due to compensation by the high positive value for batting average.
To address this issue, some features will be strategically removed from the model to reduce multicollinearity. This approach aims to untangle the complex relationships between variables, enhancing the understanding of how each feature contributes to the target variable's variation. This step is crucial for producing a more transparent and accurate regression analysis, providing meaningful insights into the factors driving the outcomes.
Since batting average and walks overlap heavily with on-base percentage, and triples and doubles overlap heavily with slugging percentage, these will be consolidated into on-base percentage and slugging percentage. This consolidation will help reduce redundancy and improve the model's interpretability.
Combating multicolinearity: Linear Regression Model 2
# Constructing another Linear Regression Model with different featuers
predictor_vars_2 = ['runs', 'on_base_percentage', 'slugging_percentage', 'home_run_per_pa']
X = batting_data_df[predictor_vars_2]
y = batting_data_df['runs_batted_in_per_pa']
scaler = MinMaxScaler()
X[predictor_vars_2] = scaler.fit_transform(X[predictor_vars_2])
# Add a constant column to X matrix for intercept
X = sm.add_constant(X)
# Perform k-fold cross-validation
num_folds = 5
kf = KFold(n_splits=num_folds, shuffle=True, random_state=1)
mse_scores = []
r2_scores = []
for train_idx, test_idx in kf.split(X):
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
model = sm.OLS(y_train, X_train)
results = model.fit()
y_pred = results.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mse_scores.append(mse)
r2_scores.append(r2)
# Calculate average MSE and R-squared across folds
avg_mse = np.mean(mse_scores)
avg_r2 = np.mean(r2_scores)
print(f"Average Mean Squared Error: {avg_mse}")
print(f"Average R-squared: {avg_r2}")
predictor_vars_2 = ['runs', 'on_base_percentage', 'slugging_percentage', 'home_run_per_pa']
X = batting_data_df[predictor_vars_2]
y = batting_data_df['runs_batted_in_per_pa']
# Add a constant column to X matrix for intercept
X = sm.add_constant(X)
# Create a linear regression model
model = sm.OLS(y, X)
# Fit the model
results = model.fit()
# Print the summary
print(results.summary())
Average Mean Squared Error: 0.0003210269003309013
Average R-squared: 0.6183617358788298
OLS Regression Results
=================================================================================
Dep. Variable: runs_batted_in_per_pa R-squared: 0.637
Model: OLS Adj. R-squared: 0.633
Method: Least Squares F-statistic: 136.7
Date: Thu, 01 Aug 2024 Prob (F-statistic): 3.04e-67
Time: 23:12:07 Log-Likelihood: 825.95
No. Observations: 316 AIC: -1642.
Df Residuals: 311 BIC: -1623.
Df Model: 4
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
const 0.0293 0.009 3.331 0.001 0.012 0.047
runs -0.0002 8.31e-05 -2.863 0.004 -0.000 -7.44e-05
on_base_percentage -0.1021 0.046 -2.228 0.027 -0.192 -0.012
slugging_percentage 0.2626 0.038 6.940 0.000 0.188 0.337
home_run_per_pa 0.6602 0.137 4.816 0.000 0.390 0.930
==============================================================================
Omnibus: 2.183 Durbin-Watson: 2.137
Prob(Omnibus): 0.336 Jarque-Bera (JB): 2.207
Skew: 0.200 Prob(JB): 0.332
Kurtosis: 2.913 Cond. No. 5.77e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.77e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Comparing the two sets of predictor variables reveals an interesting pattern. The second model, which includes fewer variables, exhibits very similar performance to the first one in explaining RBIs per plate appearance. This similarity suggests that these specific variables have a significant impact on predicting RBIs.
One of the standout findings is the importance of home runs in driving RBIs. The positive coefficient attached to home runs confirms their role in boosting RBI counts. This aligns with the common understanding that home runs often lead to multiple runs being scored, thus elevating a player's RBI tally.
An intriguing twist comes from the negative coefficient tied to on-base percentage. While initially puzzling, this can be explained by the different types of players. Those who emphasize on-base percentage—usually contact hitters—are less likely to hit home runs. Given that home runs seem to strongly contribute to RBIs, this leads to a lower RBI per plate appearance for contact hitters, potentially resulting in the observed negative coefficient for on-base percentage.
Looking at the model's performance, an R-squared value of 0.62 indicates that around 62% of the variation in RBIs per plate appearance can be explained by the variables used in the model. This suggests that while these variables provide significant insights into RBI production, other factors also play a role in the remaining variability.
The Mean Squared Error (MSE) of 0.00032, though seemingly small, needs context. Since the target variable, RBIs per plate appearance, is typically a small value below 1, squaring it for MSE calculation makes it even smaller. This highlights how the scale of the target variable influences the scale of the MSE value.
This model shows how different variables affect RBIs per plate appearance. The straightforward coefficients, the moderate R-squared value, and the scaled MSE offer a clear view of how these variables interact within the realm of baseball performance.
Random Forest Machine Learning Model: Random Forest is a machine learning algorithm that creates an ensemble of decision trees to make predictions. It aggregates the predictions of multiple trees to arrive at a more accurate and robust final prediction. In the context of baseball stats, Random Forest can be employed to predict outcomes like RBIs per plate appearance based on various player statistics. By harnessing the power of multiple decision trees and their collective insights, we can better capture the intricate relationships between different baseball metrics and enhance the understanding of factors influencing player performance.
# Split the data into training and testing sets for Random Forest Regressor model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Create a Random Forest Regressor model
rf_model = RandomForestRegressor(random_state=42)
# Define the parameter grid for grid search
param_grid = {
'n_estimators': [25, 50, 100, 150, 200],
'max_depth': [None, 5, 10, 20],
'min_samples_split': [2, 5, 10, 20]
}
# Initialize GridSearchCV with the Random Forest model and parameter grid
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
# Fit the GridSearchCV object to the training data
grid_search.fit(X_train, y_train)
# Get the best model and its hyperparameters
best_rf_model = grid_search.best_estimator_
# Make predictions on the test data using the best model
y_pred = best_rf_model.predict(X_test)
# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Best Model Parameters: {grid_search.best_params_}")
print(f"Mean Squared Error with Best Model: {mse}")
print(f"R-squared with Best Model: {r2}")
# Choose an arbitrary tree from the Random Forest (e.g., first tree)
chosen_tree = best_rf_model.estimators_[0]
plt.figure(figsize=(20, 10))
plot_tree(chosen_tree, feature_names=X_train.columns, filled=True, rounded=True)
plt.show()
feature_importances = best_rf_model.feature_importances_
sorted_indices = np.argsort(feature_importances)[::-1] # Sort indices in descending order
for idx in sorted_indices:
print(f"Feature: {X_train.columns[idx]}, Importance: {feature_importances[idx]}")
Best Model Parameters: {'max_depth': 5, 'min_samples_split': 20, 'n_estimators': 200}
Mean Squared Error with Best Model: 0.00041698845635485514
R-squared with Best Model: 0.500431181932357
Feature: slugging_percentage, Importance: 0.5153807953312785 Feature: home_run_per_pa, Importance: 0.42584553349050674 Feature: on_base_percentage, Importance: 0.031954056660413 Feature: runs, Importance: 0.02681961451780184 Feature: const, Importance: 0.0
In exploring predictive modeling techniques, it was observed that the Random Forest model demonstrated slightly worse R-squared and Mean Squared Error (MSE) metrics compared to the Linear Regression model. While this suggests that the Random Forest model might have a worse overall fit to the data, it is important to note the distinction between these evaluation approaches.
The R-squared and MSE values provide insights into model performance, but the context of model selection and hyperparameter tuning is crucial. For the Random Forest model, GridSearchCV was employed. This method exhaustively searches through various combinations of hyperparameters to identify the optimal configuration that minimizes the MSE. This process not only accounts for the individual folds in cross-validation but also selects the best-performing model based on the aggregate evaluation metrics.
In contrast, K-fold cross-validation applied to the Linear Regression model provides an average R-squared and MSE across multiple folds. This offers an effective assessment of the model's general performance but does not specifically target the best combination of hyperparameters as GridSearchCV does. Instead, it evaluates the model's consistency and overall predictive ability across different subsets of the data.
Thus, while the Random Forest model showcased slightly worse R-squared and MSE metrics, the true strength of GridSearchCV lies in its capability to pinpoint the most optimal model configuration based on these performance metrics. This distinction highlights the nuanced interplay between model selection, hyperparameter tuning, and evaluation approaches within predictive modeling. Therefore, it is not accurate to claim that one model is definitively better than the other based solely on R-squared and MSE metrics.
Both the Random Forest and Linear Regression models agree on the significance of home runs per plate appearance as a pivotal predictor for RBIs per plate appearance. Interestingly, the prominence of on-base percentage is unexpectedly low compared to metrics such as slugging and home runs. This consistency in model findings underscores the critical role of home runs in influencing RBI outcomes, while the comparatively modest importance assigned to on-base percentage warrants further investigation.
K-Nearest Neighbor Machine Learing Model: K Nearest Neighbors (KNN) is a machine learning algorithm that predicts outcomes by finding the "k" closest data points to a given observation and using their values to make a prediction. In the realm of baseball statistics, KNN can be applied to predict metrics like RBIs per plate appearance. By identifying similar players based on their statistical profiles, KNN allows us to infer a player's performance by looking at their nearest neighbors. This method leverages the idea that players with similar performance characteristics tend to have similar outcomes, offering insights into how various metrics relate to RBIs within the context of baseball.
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
import numpy as np
# Create a StandardScaler instance to scale your data
scaler = StandardScaler()
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Scale the feature matrices
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Define a range of k values to test
k_values = range(1, 21) # You can adjust this range as needed
# Initialize lists to store the average MSE for each k
mse_values = []
# Specify the number of folds for cross-validation
num_folds = 5 # Adjust as needed
# Iterate over different values of k
for k in k_values:
# Initialize a K-nearest neighbors regressor with the current k value
knn_regressor = KNeighborsRegressor(n_neighbors=k)
# Perform K-fold cross-validation and calculate MSE scores
mse_scores = -cross_val_score(knn_regressor, X_train_scaled, y_train, cv=num_folds, scoring='neg_mean_squared_error')
# Calculate the average MSE across all folds for the current k
avg_mse = np.mean(mse_scores)
mse_values.append(avg_mse)
# Plot the MSE values for different values of k
plt.figure(figsize=(10, 6))
plt.plot(k_values, mse_values, marker='o', linestyle='-', color='b')
plt.title('Mean Squared Error vs. Number of Neighbors (k)')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Mean Squared Error (MSE)')
plt.xticks(k_values) # Use the specified k values on the x-axis
plt.grid(True)
plt.show()
# Create a StandardScaler instance to scale your data
scaler = StandardScaler()
# Scale the feature matrices
X_scaled = scaler.fit_transform(X)
# Define the value of k you want to evaluate
k = 3
# Initialize a K-nearest neighbors regressor with k=3
knn_regressor = KNeighborsRegressor(n_neighbors=k)
# Perform K-fold cross-validation and get predicted values
y_pred = cross_val_predict(knn_regressor, X_scaled, y, cv=num_folds)
# Calculate MSE and R-squared for k=3
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)
print(f"Statistics for k = {k}:")
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
Statistics for k = 3: Mean Squared Error: 0.00046618166695360533 R-squared: 0.4621318270034519
In baseball, significant multicollinearity among variables is common because many stats are interrelated. This interdependence poses challenges for interpreting coefficients in a regression model. When variables are strongly correlated, it becomes difficult to isolate their individual effects on the target variable. For example, hits may appear negatively correlated with RBIs, likely due to compensation by the high positive value for batting average.
To address this issue, some features will be removed from the model to reduce multicollinearity. This approach aims to clarify how each feature contributes to the variation in the target variable, making the regression analysis more transparent and accurate.
Since batting average and walks overlap heavily with on-base percentage, and triples and doubles overlap heavily with slugging percentage, these variables will be consolidated into on-base percentage and slugging percentage. This consolidation reduces redundancy and improves the model's interpretability.
Adding more data The models above didn't quite show the results that we wanted. To improve the models, I have downloaded batting data for the past 6 years in order to have a larger dataset for the models. The idea for including data from just 2023 was that the game of baseball is always changing, so it may be best to build the models based on recent data. However, there is very limited data from just one season, data was fetched for the last 6 years.
# Initialize a list to store the data for each year (except the current year)
batting_data = []
# Loop through the last 6 years (from 2018 to 2023)
for year in range(2018, 2024):
# Read in the CSV file for the current year
file_path = f"batting/{year}_batting.csv"
df = pd.read_csv(file_path)
# Append the data to the list
batting_data.append(df)
# Concatenate the dataframes for each year
combined_data = pd.concat(batting_data)
# Specify the columns you want to aggregate
columns_to_aggregate = ['G', 'PA', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'TB', 'GDP', 'HBP', 'SH', 'SF', 'IBB']
Some columns, such as on base percentage and slugging percentage can't just be aggregated and added together. Luckily, we have the columns necessary to manually calculate them.
# Group by 'Name' and aggregate the specified columns
agg_data = combined_data.groupby('Name')[columns_to_aggregate].sum().reset_index()
# Filter for players with over 1000 plate appearances
agg_data_filtered = agg_data[agg_data['PA'] > 1000]
# Calculate 'OBP' (On-Base Percentage) and 'SLG' (Slugging Percentage) using .loc
agg_data_filtered.loc[:, 'OBP'] = (agg_data_filtered['H'] + agg_data_filtered['BB'] + agg_data_filtered['HBP']) / agg_data_filtered['PA']
agg_data_filtered.loc[:, 'SLG'] = agg_data_filtered['TB'] / agg_data_filtered['AB']
# Manually input 'OPS' (On-base Plus Slugging)
agg_data_filtered['OPS'] = agg_data_filtered['OBP'] + agg_data_filtered['SLG']
# Print or further analyze the filtered aggregated data with fixed metrics
print(agg_data_filtered)
Name G PA AB R H 2B 3B HR RBI ... \
3 AJ Pollock 617 2249 2056 284 520 106 8 97 301 ...
13 Aaron Hicks# 578 2180 1844 310 439 66 9 73 258 ...
14 Aaron Judge 642 2799 2342 467 664 107 1 196 436 ...
24 Abraham Toro# 366 1309 1177 156 258 42 3 39 154 ...
26 Adalberto Mondesí# 286 1157 1085 160 277 52 17 35 141 ...
... ... ... ... ... ... ... ... .. ... ... ...
2282 Yoshi Tsutsugo* 318 1077 939 114 197 41 4 34 134 ...
2285 Yoán Moncada# 664 2809 2506 332 640 142 17 82 312 ...
2288 Yuli Gurriel 723 2919 2671 349 745 172 7 77 371 ...
2302 Zach McKinstry* 315 1035 932 117 202 41 10 25 89 ...
2322 Óscar Mercado 407 1228 1131 157 261 58 7 34 132 ...
SO TB GDP HBP SH SF IBB OBP SLG OPS
3 461 933 38 23 1 23 8 0.306358 0.453794 0.760152
13 486 742 24 10 3 15 7 0.347248 0.402386 0.749634
14 775 1361 61 18 0 19 35 0.391568 0.581127 0.972696
24 220 423 17 25 0 9 1 0.291062 0.359388 0.650450
26 342 468 16 3 11 7 0 0.286085 0.431336 0.717421
... ... ... ... ... .. .. ... ... ... ...
2282 289 348 16 4 2 8 1 0.301764 0.370607 0.672371
2285 808 1062 22 20 4 11 6 0.330367 0.423783 0.754150
2288 340 1162 80 24 0 37 7 0.327509 0.435043 0.762552
2302 264 338 7 6 7 5 1 0.283092 0.362661 0.645753
2322 243 435 34 10 9 8 0 0.277687 0.384615 0.662303
[399 rows x 23 columns]
The columns were mapped to match the other dataframe
# Creating a dictionary to map the variables.
column_name_mapping = {
'Age': 'age',
'Lev': 'league',
'Tm': 'team',
'G': 'games',
'PA': 'plate_appearances',
'AB': 'at_bats',
'R': 'runs',
'H': 'hits',
'2B': 'double',
'3B': 'triple',
'HR': 'home_run',
'RBI': 'runs_batted_in',
'BB': 'walk',
'IBB': 'intentional_walk',
'SO': 'strikeouts',
'HBP': 'hit_by_pitch',
'SH': 'sacrifice_hit',
'SF': 'sacrifice_fly',
'GDP': 'double_play',
'BA': 'batting_average',
'OBP': 'on_base_percentage',
'SLG': 'slugging_percentage',
'OPS': 'on_base_plus_slugging'
}
agg_data_filtered.rename(columns=column_name_mapping, inplace=True)
agg_data_filtered.head()
| Name | games | plate_appearances | at_bats | runs | hits | double | triple | home_run | runs_batted_in | ... | strikeouts | TB | double_play | hit_by_pitch | sacrifice_hit | sacrifice_fly | intentional_walk | on_base_percentage | slugging_percentage | on_base_plus_slugging | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | AJ Pollock | 617 | 2249 | 2056 | 284 | 520 | 106 | 8 | 97 | 301 | ... | 461 | 933 | 38 | 23 | 1 | 23 | 8 | 0.306358 | 0.453794 | 0.760152 |
| 13 | Aaron Hicks# | 578 | 2180 | 1844 | 310 | 439 | 66 | 9 | 73 | 258 | ... | 486 | 742 | 24 | 10 | 3 | 15 | 7 | 0.347248 | 0.402386 | 0.749634 |
| 14 | Aaron Judge | 642 | 2799 | 2342 | 467 | 664 | 107 | 1 | 196 | 436 | ... | 775 | 1361 | 61 | 18 | 0 | 19 | 35 | 0.391568 | 0.581127 | 0.972696 |
| 24 | Abraham Toro# | 366 | 1309 | 1177 | 156 | 258 | 42 | 3 | 39 | 154 | ... | 220 | 423 | 17 | 25 | 0 | 9 | 1 | 0.291062 | 0.359388 | 0.650450 |
| 26 | Adalberto Mondesí# | 286 | 1157 | 1085 | 160 | 277 | 52 | 17 | 35 | 141 | ... | 342 | 468 | 16 | 3 | 11 | 7 | 0 | 0.286085 | 0.431336 | 0.717421 |
5 rows × 23 columns
# Create a scatter plot of hits vs. runs_batted_in
plt.figure(figsize=(10, 5))
plt.scatter(agg_data_filtered['hits'], agg_data_filtered['runs_batted_in'], alpha=0.5)
plt.title('Hits vs. Runs Batted In')
plt.xlabel('Hits')
plt.ylabel('Runs Batted In')
plt.grid(True)
plt.show()
# Create a scatter plot of runs_batted_in vs. home_run
plt.figure(figsize=(10, 5))
plt.scatter(agg_data_filtered['home_run'], agg_data_filtered['runs_batted_in'], alpha=0.5)
plt.title('Runs Batted In vs. Home Runs')
plt.xlabel('Home Runs')
plt.ylabel('Runs Batted In')
plt.grid(True)
plt.show()
# Create a scatter plot of on_base_plus_slugging (OPS) vs. runs_batted_in
plt.figure(figsize=(10, 5))
plt.scatter(agg_data_filtered['on_base_plus_slugging'], agg_data_filtered['runs_batted_in'], alpha=0.5)
plt.title('On-Base Plus Slugging (OPS) vs. Runs Batted In')
plt.xlabel('OPS')
plt.ylabel('Runs Batted In')
plt.grid(True)
plt.show()
Analyzing the New Scatter Plots Recent scatter plots have revealed interesting trends that provide valuable insights into the relationships between key baseball statistics. These new graphs, based on a larger dataset spanning five seasons, offer a more comprehensive view of player performance.
The relationships between the variables now show enhanced linearity, largely due to the increased number of data points from multiple seasons. This broader spectrum of player performance makes the improvement in linearity evident when compared to plots based on a single season's data.
The scatter plot comparing On-Base Plus Slugging (OPS) to Runs Batted In (RBI) reveals a moderately positive correlation. This indicates that as a player's OPS increases, their RBI count tends to rise as well. This highlights the importance of a player's ability to get on base and produce extra-base hits, which significantly contribute to RBI totals.
The next step involves normalizing these statistics per plate appearance to provide a more precise comparison. This normalization ensures a level playing field for all players, regardless of the number of plate appearances. By doing so, the analysis will yield more accurate conclusions about player performance and its correlation with key metrics.
By continually refining the analysis and leveraging a multi-season dataset, a more robust model for evaluating player performance in baseball can be developed.
# Normalizing variables
agg_data_filtered
variables_to_normalize = ['hits', 'double', 'triple', 'home_run', 'runs_batted_in',
'walk', 'intentional_walk', 'strikeouts', 'hit_by_pitch',
'sacrifice_hit', 'sacrifice_fly', 'double_play']
# Iterate through the variables and add normalized columns, then drop the original columns
for var in variables_to_normalize:
normalized_col_name = f'{var}_per_pa'
agg_data_filtered[normalized_col_name] = agg_data_filtered[var] / agg_data_filtered['plate_appearances']
agg_data_filtered.drop(columns=[var], inplace=True)
# Define the x-variables to plot
x_variables = ['hits_per_pa', 'home_run_per_pa', 'on_base_plus_slugging']
# Create scatter plots for RBI per PA vs. the selected x-variables
for x_var in x_variables:
plt.figure(figsize=(10, 5))
plt.scatter(agg_data_filtered[x_var], agg_data_filtered['runs_batted_in_per_pa'], alpha=0.5)
plt.title(f'RBI per PA vs. {x_var.capitalize()}')
plt.xlabel(x_var.capitalize())
plt.ylabel('RBI per PA')
plt.grid(True)
plt.show()
Analyzing the Relationships
The in-depth analysis of the relationship between Runs Batted In per Plate Appearance (RBI per PA) and key baseball statistics—Hits, Home Runs, and On-Base Plus Slugging (OPS)—has provided valuable insights into the factors influencing a player's offensive performance.
Home Runs per PA and RBI per PA: A standout finding is the strong connection between Home Runs per PA and RBI per PA. This relationship highlights the critical role of power hitting, particularly home runs, in driving in runs. The correlation suggests that players who excel at hitting home runs tend to accumulate RBIs at a higher rate. This observation aligns with previous machine learning and regression models, which consistently identified home runs as a key predictor of RBI production. The robustness of this relationship, even with a larger dataset, underscores the significant impact of home runs on run production.
Hits per PA and RBI per PA: Conversely, the relationship between Hits per PA and RBI per PA is less pronounced. While hits are essential for offensive success, this analysis indicates that their impact on RBI production may not be as substantial as power hitting. This suggests that accumulating hits alone does not necessarily lead to a higher RBI count.
OPS and RBI per PA: For OPS, which combines a player's on-base ability and slugging power, there is a balanced yet moderate connection with RBI production. This equilibrium highlights the importance of both getting on base and hitting for power. Previous machine learning and regression models also pointed to OPS as a valuable predictor of RBIs, and this observation with a larger dataset reaffirms its relevance.
Summary: The findings from this analysis resonate with insights from previous machine learning and regression models. The consistent relationship between home runs and RBIs, even with more data, emphasizes the pivotal role of power hitting in driving in runs. These insights deepen the understanding of player performance dynamics in baseball, highlighting the factors that significantly contribute to success on the field.
Linear Regression Model
# Define predictor variables and the dependent variable
predictor_vars = ['runs', 'on_base_percentage', 'slugging_percentage', 'home_run_per_pa']
X = agg_data_filtered[predictor_vars]
y = agg_data_filtered['runs_batted_in_per_pa']
# Scale the predictor variables
scaler = MinMaxScaler()
X[predictor_vars_2] = scaler.fit_transform(X[predictor_vars_2])
# Add a constant column to X matrix for intercept
X = sm.add_constant(X)
# Perform k-fold cross-validation
num_folds = 5
kf = KFold(n_splits=num_folds, shuffle=True, random_state=1)
mse_scores = []
r2_scores = []
for train_idx, test_idx in kf.split(X):
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
model = sm.OLS(y_train, X_train)
results = model.fit()
y_pred = results.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mse_scores.append(mse)
r2_scores.append(r2)
# Calculate average MSE and R-squared across folds
avg_mse = np.mean(mse_scores)
avg_r2 = np.mean(r2_scores)
print(f"Average Mean Squared Error: {avg_mse}")
print(f"Average R-squared: {avg_r2}")
# Create a linear regression model
model = sm.OLS(y, X)
# Fit the model
results = model.fit()
# Print the summary
print(results.summary())
Average Mean Squared Error: 0.00012198247368073569
Average R-squared: 0.7535595648443909
OLS Regression Results
=================================================================================
Dep. Variable: runs_batted_in_per_pa R-squared: 0.769
Model: OLS Adj. R-squared: 0.766
Method: Least Squares F-statistic: 327.3
Date: Thu, 01 Aug 2024 Prob (F-statistic): 8.66e-124
Time: 23:12:48 Log-Likelihood: 1238.7
No. Observations: 399 AIC: -2467.
Df Residuals: 394 BIC: -2448.
Df Model: 4
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
const 0.0700 0.002 33.478 0.000 0.066 0.074
runs -0.0103 0.003 -3.208 0.001 -0.017 -0.004
on_base_percentage -0.0124 0.006 -2.192 0.029 -0.024 -0.001
slugging_percentage 0.0712 0.009 7.708 0.000 0.053 0.089
home_run_per_pa 0.0588 0.006 9.848 0.000 0.047 0.071
==============================================================================
Omnibus: 0.233 Durbin-Watson: 2.090
Prob(Omnibus): 0.890 Jarque-Bera (JB): 0.108
Skew: 0.024 Prob(JB): 0.947
Kurtosis: 3.065 Cond. No. 27.3
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Leveraging More Data for Improved Predictions
The regression model's performance significantly improved when applied to a five-year dataset. The Mean Squared Error (MSE) decreased from 0.0032 to 0.0012, and the R-squared value increased from 0.62 to 0.75.
This improvement highlights the value of a larger dataset. With more data points, the model gains a deeper understanding of the relationships between predictor variables like runs, slugging percentage, and home runs per Plate Appearance (PA) and the dependent variable, runs batted in per PA.
More data enables the model to capture a broader range of player performances and reduces the risk of overfitting. In this context, it reaffirms the significance of power hitting, particularly home runs and slugging, over simply getting on base.
The regression results with the larger dataset provide quantitative evidence supporting the importance of power hitting in driving in runs. Notably, the coefficient associated with 'slugging_percentage' stands out, indicating a strong link between slugging and runs batted in. Additionally, the coefficient for 'home_run_per_pa' underscores the pivotal role of home runs in run production
Random Forest Model
predictor_vars = ['runs', 'on_base_percentage', 'slugging_percentage', 'home_run_per_pa']
X = agg_data_filtered[predictor_vars]
y = agg_data_filtered['runs_batted_in_per_pa']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Create a Random Forest Regressor model
rf_model = RandomForestRegressor(random_state=42)
# Define the parameter grid for grid search
param_grid = {
'n_estimators': [25, 50, 100, 150],
'max_depth': [None, 5, 10, 20],
'min_samples_split': [2, 5, 10, 20]
}
# Initialize GridSearchCV with the Random Forest model and parameter grid
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
# Fit the GridSearchCV object to the training data
grid_search.fit(X_train, y_train)
# Get the best model and its hyperparameters
best_rf_model = grid_search.best_estimator_
# Make predictions on the test data using the best model
y_pred = best_rf_model.predict(X_test)
# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Best Model Parameters: {grid_search.best_params_}")
print(f"Mean Squared Error with Best Model: {mse}")
print(f"R-squared with Best Model: {r2}")
# Choose an arbitrary tree from the Random Forest (e.g., first tree)
chosen_tree = best_rf_model.estimators_[0]
plt.figure(figsize=(20, 10))
plot_tree(chosen_tree, feature_names=X_train.columns, filled=True, rounded=True)
plt.show()
feature_importances = best_rf_model.feature_importances_
sorted_indices = np.argsort(feature_importances)[::-1] # Sort indices in descending order
for idx in sorted_indices:
print(f"Feature: {X_train.columns[idx]}, Importance: {feature_importances[idx]}")
Best Model Parameters: {'max_depth': None, 'min_samples_split': 20, 'n_estimators': 100}
Mean Squared Error with Best Model: 0.00012422084331287662
R-squared with Best Model: 0.7395001595887596
The Random Forest Regressor model also exhibited notable improvement when applied to a larger five-year dataset. In the previous analysis using a one-year dataset, the Mean Squared Error (MSE) was 0.0004, and the R-squared value stood at 0.5. At that time, the importance score for 'home runs' was 0.4.
With the addition of more data, we observed a substantial reduction in the MSE to 0.0012, and the R-squared value increased to 0.67. Interestingly, the importance score for 'home runs' surged to 0.81, highlighting its dominant role in predicting runs batted in. On the other hand, the importance of 'slugging' decreased to 0.14.
These findings further emphasize the importance of home runs in run production. The RandomForestRegressor model's enhanced performance with more data corroborates that power hitting, specifically through home runs, remains a pivotal factor in a player's ability to contribute significantly to offensive success in baseball.
Evaluating Model Performance
Random Forest with 6-Season Data:
Linear Regression with 6-Season Data:
KNN with 1-Season Data:
Random Forest with 1-Season Data:
Linear Regression Model 2 with 1-Season Data:
Linear Regression Model 1 with 1-Season Data:
In summary, the evaluation of model performance reveals that both the Random Forest and Linear Regression models with 6-season data stand out as strong performers, with excellent predictive accuracy and the ability to explain a substantial portion of the variance in the data. However, when dealing with 1-season data, the performance of the models is slightly reduced, with KNN exhibiting higher MSE, and Random Forest and Linear Regression models providing reasonable but less accurate predictions compared to the 6-season data models. The choice of the most suitable model should consider the specific context and requirements of your analysis, along with further exploration and validation.
Exploratory Data Analysis (EDA)
To understand the relationships between player statistics and Runs Batted In (RBIs) per plate appearance, a comprehensive Exploratory Data Analysis (EDA) was conducted. By employing visualizations and data exploration techniques, the analysis aimed to uncover insights that would guide subsequent modeling efforts.
Positive Correlations: The EDA revealed consistent positive correlations between various variables and RBIs per plate appearance. These correlations provided valuable initial insights into how player statistics relate to RBIs, acting as signals for predictive modeling.
Machine Learning and Regression Models
Dominant Role of Home Runs: The machine learning and regression models highlighted the pivotal role of home runs in predicting RBIs. Both Random Forest and Linear Regression models underscored the significance of home runs in a player's ability to drive in runs, reaffirming the conventional wisdom that "home runs drive the offense."
Complexity of On-Base Percentage (OBP): While metrics like slugging and home runs were significant, on-base percentage emerged as less influential. This may be due to the diverse offensive strategies employed by players. Contact hitters who emphasize on-base percentage might have different RBI profiles compared to power hitters.
Future Directions and Considerations
The project sets the stage for further exploration and offers several avenues for future research. One promising direction is the incorporation of additional variables or the engineering of new ones. For example, metrics like batting average with runners in scoring position could provide a more nuanced understanding of a player's clutch performance.
Exploring New Variables: Team on-base percentage (OBP) was also considered, but the results showed little linear correlation with RBIs. This doesn't preclude the possibility of incorporating other meaningful variables in future analyses.
Ongoing research is encouraged to refine models and explore novel dimensions of player performance. Expanding the scope of analysis and considering diverse factors can deepen the understanding of baseball statistics and their impact on RBIs.
Real-World Application
Strategic Player Acquisition: Baseball teams in need of an offensive boost can leverage this model to make informed decisions when acquiring new players. By inputting a hitter's statistics into the model, teams can project how many RBIs they are likely to contribute per plate appearance. While RBIs alone don't determine wins, the project's emphasis on the importance of home runs in driving runs suggests that struggling teams seeking offensive firepower may prioritize recruiting players with a knack for hitting home runs, potentially rejuvenating their offensive capabilities.